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ABSTRACT 


A  characteristic  shared  by  many  computation  intensive  algorithms  is  the  repeated  usage  of  a  few 
data  values  in  a  sequence  of  computations.  An  efficient  parallel  implementation  of  these  data 
dependences  often  requires  the  simultaneous  transfer,  or  broadcatting,  of  the  data  values  to  all  the 
processors  that  need  them.  Unfortunately,  direct  realisation  of  this  broadcasting  operation  on  VLSI 
processor  arrays,  especially  on  systolic  arrays,  usually  results  in  severe  performance  degradation. 

A  technique  for  decomposing  broadcasting  dependences  into  propagation  dependences  at  the 
algorithm  level  is  presented  in  this  paper.  Such  propagation  dependences,  when  physically  realized, 
result  in  pipelining.  The  determination  of  a  feasible  propagation  scheme  is  formulated  as  a  linear 
algebra  problem.  Wo  pee  re  that  all  broadcastings  can  be  decomposed  into  propagations  and  we 
propose  a  systematic  method  for  finding  such  decompositions. 
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1.  INTRODUCTION 


A  major  obstacle  to  the  efficient  implementation  of  algorithms  on  processor  arrays  with 
distributed  memory  is  that  random  data  access  in  constant  time  is  not  possible.  The  data  access  time 
in  a  processor  array  depends  on  the  physical  distance  between  the  processors  which  generate  and  use 
the  data,  which  is  a  function  of  the  global  interconnection  pattern  used  for  data  routing. 

A  characteristic  shared  by  many  computation  intensive  algorithms  is  the  repeated  usage  of  input 
data  or  of  intermediate  results  in  sequences  of  computations,  see,  e.g. ,  the  algorithms  in  |l|.  While 
this  kind  of  data  dependence  poses  no  problem  in  a  uniprocessor  environment,  a  minimal  time  parallel 
implementation  on  an  ideal  machine  with  negligible  communication  cost  would  often  require  the 
simultaneous  transfer,  or  broadcasting,  of  data  to  all  the  processors  which  use  them  in  subsequent 
computations.  Simulation  of  the  broadcasting  operation,  however,  usually  results  in  severe 
performance  degradation  in  array  implementations  because  the  I/O  bandwidth  between  the  processor 
array  and  the  host  is  limited,  and  the  fan-out  degree  of  the  processors  is  bounded. 

There  are  two  commonly  known  solutions  to  the  broadcasting  problem  on  processor  arrays: 
architectural  transformation  and  algorithmic  transformation.  Architectural  transformation  is  based  on 
the  retiming  procedure  proposed  by  Leiserson  and  Saxe  [2]  for  inserting  delays  in  the  broadcasting 
paths  on  the  array,  effectively  replacing  data  broadcasting  with  pipelining,  see  the  example  in  [3]. 
While  this  approach  is  systematic  and  well  formulated,  the  derivation  of  the  broadcast-free  array  has 
a  complexity  proportional  to  the  product  of  the  number  of  processors  and  the  number  of 
interconnections  in  the  array.  The  algorithmic  transformation  approach,  on  the  other  hand, 
eliminates  broadcasting  at  the  algorithm  level.  A  powerful  type  of  transformation  is  the 
decomposition  of  broadcasting  dependences  into  propagation  dependences  |4],  which,  when  physically 
realized,  result  in  pipelining.  The  decomposition  approach  deals  with  broadcasting  at  a  higher  level 
than  architectural  transformation  and  hence,  may  result  in  parallel  implementations  of  better 
performance. 

To  illustrate  the  decomposition  approach,  consider  the  1-D  recursive  filtering  algorithm.  The 
output  signal  y  is  computed  from  the  input  signal  z  and  two  sets  of  weights  w  and  r  according  to 
the  equation 

*  k 

Vi  —  £  WjZi.j  +  2  fjVi-j  ,  k  <  i  <  n  ,  k  «  n  . 

/= i  ;=i 

The  equation  may  be  expressed  in  single  assignment  form: 

v(«.  ;  )  —  v(*\  j  +1)  +  /)*(* -1.  ;-l)  +  r(a--l.  j  )y  (»-./•  1) 

©(*,;')*  ®(i-i,  /) 
z(t ,  j  )  =  z(t-l,  ;-l) 
r(»,  j)  —  r(»-l,  j), 
with  boundary  conditions 

V  (•',*+ 1)  *  0,  w[k -l,j )  —  Wj,  r{k-l,j)  —  r,, 

?(*-!,;')  —  Zk-j-i  .  *M)  =  *,  • 

The  final  results  are  y,  =  y(»,  1).  Each  computed  value  y(f ,  1)  is  needed  by  several  computations 


of  tt  (*  i  J )'.  the  valve  of  y  (/ ,  1)  has  to  be  made  available  to  all  the  computations  y(i,  j),  where 
i-j  mm  I.  The  data  dependences  associated  to  this  broadcasting  are  shown  in  Figure  la,  in  which 
*  «  6  and  /  -  8. 

To  eliminate  the  broadcasting,  we  introduce  a  propagation  variable,  y ,  defined  by  the 
conditional  recurrence 

..  f  F(*-l,i-l)  ifj  ^l 

F(* .  J  I  \  p(i_j,y)  otherwise’ 

and  replace  the  dependence  of  y  (i ,  j )  on  y  (i  -j  ,1)  with  the  dependence  on  y  (» ,  j ), 

v(». ;)-  »(».  y+i)  +  y)F(*-i,  y-i)  +  f(*-i,  y )y (» ,  y). 

The  function  of  y(i ,  j)  is  to  transfer,  or  propagate,  the  value  of  y(i-j,  1)  to  all  the  computations 
that  use  it.  For  example,  the  direct  dependence  of  computation  if  (12,4)  on  the  value  of  y  (8,1)  is  now 
replaced  by  the  dependence  path 

if  (12,4)  -  y  (12,4)  -  F(H,3)  -  y  (10,2)  -  y(9,l)  -  y(8,l)  . 

N'-te  that  all  the  y  (« ,j  )’s  along  the  dependence  path  need  value  y(8,l)  and  that  the  propagation 
variable  y  brings  them  that  value.  With  propagation,  the  broadcasting  dependences  in  Figure  la  are 
decomposed  into  the  propagation  dependences  shown  in  Figure  lb. 

The  idea  of  replacing  broadcasting  with  propagation  is  not  new.  Many  innovative  processor 
array  implementations  have  been  derived  through  the  use  of  such  a  transformation:  algebraic  path 


• - >•  :  y  on  y 


Figure  la  i  Broadcasting  dependences  on  the  value  of  y  (8, 1)  in  the  1-D  recursive  filtering  example. 
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problem  |5],  LDU  decomposition  of  matrices  [4),  dynamic  programming  [6],  etc  .  Without  a  rigid 
formalism,  however,  algorithm  transformations  must  be  carried  out  through  tedious  data  dependence 
analyses,  which  can  handle  in  practice  only  the  simplest  forms  of  broadcasting. 

In  this  paper,  we  formulate  the  decomposition  of  broadcasting  dependences  into  propagation 
dependences  and  present  a  systematic  method  for  determining  the  appropriate  decomposition.  In 
Section  2,  the  algorithm  model,  upon  which  our  discussion  is  based,  is  presented.  In  Section  3 
propagation  is  defined  mathematically  and  all  broadcastings  are  proved  to  be  decomposable  into 
propagations.  In  Section  4,  a  special  case  of  broadcasting,  in  which  the  data  to  be  broadcast  come 
from  external  sources  (input  data),  is  considered.  Open  problems  and  current  research  in  the  area  are 
discussed  in  Section  5. 


1.  ALGORITHM  MODEL 

The  Linear  Dependence  Algorithms  (LDA’s),  are  a  generalization  of  the  Regular  Iterative 
Algorithms  (RIA's)  (4]  and  Uniform  Recurrence  Equations  (URE's)  [7].  Ai:  LDA  is  expressed  as  a  set 
of  r  recurrence  equations  defined  within  an  n -dimensional  index  space  called  the  computation  domain. 
C.  The  data  dependent  s  among  the  computations  of  an  LDA  are  linear  (or  affine)  functions  of  the 
coordinates  of  the  computation  points. 


•.(/>)-  fii*iS4‘>Sp ))'  •  •  ) ;  1  <  »'  <  r ,  jm  €  tr ,  P  €  c, ,  C  c . 

The  computed  variable* ,  ,  1  <  »  <  r ,  represent  the  quantities  computed  by  the  LDA,  and 

the  J i'%  are  the  functions  evaluated  ia  order  to  compute  these  values.  Note  that,  unlike  RIA’s,  in 
which  the  full  set  of  the  r  computed  variables  is  computed  at  every  iadex  point  of  the  computation 
domain  (except,  possibly,  for  index  points  on  the  boundary  of  the  domain),  LDA’s  feature  selective 
computations  of  subsets  of  these  r  variables  at  difereat  iadex  locations  throughout  the  domain: 
variable  s,  is  computed  at  all  the  points  ia  the  subset,  C< ,  of  C.  The  input  varitblee  ,  a , ,  >  r , 

of  the  LDA  are  data  values  supplied  to  the  LDA  from  external  sources. 

The  dependence  mapping,  dt) ,  is  an  affine  mapping  which  defines  the  dependence  between  the 
indices  of  the  computed  variable  s< ,  and  the  indices  of  the  variable  s, ;  the  computation  of  a,  (P ) 
requires  the  use  of  the  quantity  s,  (4,  (/*)).  A  dependence  mapping  comprises  a  linear  part,  B,  which 
is  an  m  X  n  integral  matrix,  and  a  constant  part,  A<,- ,  which  is  a  constant  integral  column  m  -vector, 

dij(P)  -  Bjj  P  -  A0  . 

where  the  value  of  m  is  n  if  /  <  r ,  i.e.,  if  s,  is  also  a  computed  variable,  and  satisfies 
1  <  m  <  n  ,  otherwise.  Here,  we  assume  that  all  the  computed  variables  are  fully  indexed,  i.e.,  each 
of  them  is  indexed  by  an  n  -vector.  This  can  be  achieved  by  introducing  one  or  more  addi’ional 
indices  to  those  computed  variables  which  are  not  already  fully-indexed.  There  are  simple  heuristics 
for  adding  these  missing  indices  [8|. 

The  dependence  vector,  vt](P ),  associated  to  the  dependence  mapping  di} ,  i,j  <  r,  at  an 
index  point  P  at  which  s,  is  computed,  is  given  by 

v<iiP)  “  p  -  d',lp B,t)P  +  A„ 

where  /„  is  the  order  n  identity  matrix.  A  dependence  mapping  dt) ,  ij  <  r  is  a  Irantlation  if 
BtJ  ■»  /,  and  then 

t-„(P)-A0  ypec,. 

The  RIA's  and  URE's  are  particular  cases  of  LDA's  in  which  all  the  dependence  mappings  are 
translations  and  C,  »  C,  1  <  i  <  r  .  The  dependence  vectors  are  constant  in  these  cases. 


3.  BROADCASTING 

In  this  section,  we  formally  define  the  decomposition  technique  for  the  elimination  of 
broadcasting  in  LDA’s.  Two  kinds  of  propagation  are  considered,  elementary  and  composite 
propagation.  In  elementary  propagation,  the  propagation  dependence  paths  are  constructed  from 
elementary  (or  canonical)  vectors.  We  present  a  necessary  and  sufficient  condition  under  which  a 
broadcasting  dependence  is  decomposable  into  elementary  propagation.  A  polynomial  time  algorithm 
for  constructing  the  propagation  variables  is  given.  Composite  propagation  is  the  generalization  of 
elementary  propagation,  with  propagation  dependence  vectors  that  are  not  necessary  canonical.  We 
show  that  all  broadcasting  dependences  can  be  decomposed  into  composite  propagations  In  this 
section,  we  consider  only  the  broadcasting  of  the  values  of  computed  variables  in  a  LDA.  The 
broadcasting  of  input  variables,  which  is  a  special  case,  will  be  discussed  in  Section  4. 
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1.1.  Eknmtary  Propagatloa 

An  LDA  requires  broadcasting  if  the  value  of  a  computed  variable  is  needed  by  several  other 
computations  of  the  same  or  a  different  variable.  This  can  be  detected  easily  from  the  recurrence 
definitions  of  the  LDA: 

Dufinttloni 

A  dependence  mapping  d,t ,  where  s,  and  t}  are  computed  variables  of  the  LDA,  is  a 
broadcaeting  dependence  mapping  if  the  linear  part  of  d,t ,  the  n  X  n  integral  matrix  Bt] ,  is 
rank  deficient. 

The  computation  of  s,  at  point  P  uses  the  value  of  a}  computed  at  point  d,t(P ).  If  B,,  is  rank 
deficient  then  there  exist  Px,  P2,  ■  •  •  ,  such  that  ii;(Pt)  *»  dtJ  (P2)  »•■■■«  P0,  say.  The 
computations  of  4  at  points  Pu  P2,  •  -  •  ,  all  need  the  value  of  a}  computed  at  point  P0>  hence  the 
value  of  Sj(P«)  must  be  broadcast  to  all  these  points.  For  convenience,  we  drop  the  subscripts  and 
rename  a,  as  a,  a}  as  b,  and  d,t  as  d. 

The  basic  idea  of  elementary  propagation  is  as  follows.  We  associate  to  each  broadcasting 
dependence  mapping,  d,  a  set  of  propagation  variable*,  7t|,  7t  ,  •  •  •  ,  where  {kv  t2,  •  •  •  ,  kn  }  is 
a  permutation  of  the  set  of  integers  {1,  ■  ■  • ,  n  },  for  transferring  the  broadcast  value,  b,  to  the  index 
points  at  which  b  is  needed.  Each  of  these  propagation  variables  is  responsible  for  propagation  along 
one  elementary  direction,  et  ,  where  is  a  canonical  basis  vector.  The  portion  of  dependence  path 
in  the  direction  of  is  called  the  g  th  tection  of  the  overall  propagation  dependence  path. 

Let  P0  be  the  index  point  at  which  the  value  of  b  to  be  broadcast  is  computed  and  assume  that 
the  computation  of  a  at  P,  needs  the  value  of  >(P9),  hence,  d(P,)  ■*  P0.  The  original  data 
dependence 

•  (Pi)  —  6(rf(P,))  —  fi(/,0) 

is  replaced  by  a  sequence  of  propagation  dependences  starting  with 

«(P«)-  *»,(P .)• 

The  variable  fi*  performs  propagation  in  the  direction  of  ?t|,  the  fcjth  elementary  direction.  The 
objective  is  to  create  a  dependence  path  from  the  current  index  point,  P,  in  this  case,  to  the  index 
point  P3  whose  A  ,th  component  equals  the  k,th  component  of  P0,  the  index  point  at  which  the 
required  b  is  computed.  The  first  section  of  the  propagation  path  is  therefore, 

« (Pi)  —  —  ^.(Pi+'i?*,)  —  2*,(Pi+2t,?tl)  —  -  3*,(P2-«i^W1)  • 

where  <,  »  1  or  -1  or  0  is  the  tenee  of  propagation  in  the  direction  .  The  second  section  of  the 
propagation  path,  which  extends  in  the  direction  of  tt),  begins  at  P2  with  propagation  variable  atj. 
Propagation  in  this  direction  is  continued  until  the  index  point  P3  whose  fc2t h  component  equals  that 
of  P0,  is  reached, 

«(Pi)  —  3*, (Pi)  —  •  •  st.lPr-ti?*,)  —  ®*,(P 2)  -  «»,( P?  +  ‘r**,)  -  -  atj{P3  -  c2?»s) 

Since  propagation  in  the  second  section  does  not  change  the  t^h  component,  the  k,th  component  of 
Pj  equals  the  *,th  component  of  P0.  Repeating  this  procedure  for  each  of  the  elementary  directions, 
the  final  dependence  path  is 
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•mi+M  nm  ns  t.  > » at »  im 


o  (Pi)  -»  Sik^l)  •••  “*  ->5*.(/>n)“*  '••  “*  3k,  (Pn  +l-£»  \  )  • 

If  we  let  b  be  7t>  ,  this  dependence  path  is  a  decomposition  of  the  original  broadcasting  dependence 
of  a  (Pi)  on  b(P 0),  since  Pm+l  s  P0.  The  introduction  of  the  propagation  variables  does  not  modify 
the  precedence  relationship  between  the  computations  of  a  and  b  at  index  points  within  the 
computation  domain,  hence  the  computability  of  the  transformed  LDA  is  guaranteed.  The  feasibility 
of  the  elementary  propagation  scheme,  however,  depends  on  the  existence  of  the  appropriate 
propagation  variables  .for  performing  the  required  propagations. 

In  general,  the  propagation  variables,  ,  1  <  q  <  n ,  are  defined  by  conditional  recurrences 
of  the  form 


I*,(P) 


2*,(P  +  3k,  )  if  (P  +  3k,  H*,  <  P  0*, 

5»,(P  -  ?*,)  if  (P  -3*,  **,  >  Pot, 

3kf<1(P  +  3^)  if  (P  +  ?t'  Us,  =  Pot,  .  V  P  < 
Vi(P  ~  7*f )  »f  (P  -  3k,  It,  *  Pot, 

V,(P)  ^  (P  lit,  -Pot, 


(1) 


where  (P  +  is  the  kt th  component  of  the  vector  sum  (P  +  ?),  and  p0t,  is  the  kt th  component 
of  P0,  the  index  point  at  which  the  required  data  value  b  is  located.  The  first  two  conditionals 
determine  the  sense  (either  positive  or  negative)  of  propagation  and  the  remaining  conditionals  define 
the  terminating  conditions  of  the  q  th  section.  The  value  of  p0*  depends  on  the  starting  index  point, 
Plt  of  the  dependence  path.  Since  Px  is  arbitrary,  and  hence  P0  is  arbitrary,  it  must  be  possible  to 
determine  the  value  of  p  from  the  current  index  point  P  and  the  constant  part,  A,  of  d  for  the 
propagation  scheme  to  work  with  only  one  set  of  propagation  variables  for  all  P  i's. 

From  (1),  the  index  points  P  on  the  q  th  section  of  the  propagation  path  have  the  property  that 
their  k,  th  component  is  equal  to  the  k,  th  component  of  P0  for  r  <  q  ,  and  to  the  kr  th  component  of 
P ,  for  r  >  q  , 


Po*-  ,  r  <  q 
Pit,  -  r  >  7 


These  components  are  called  the  invariant  components  of  the  q  tb  section  of  the  propagation  path. 
Hence,  if  d  is  such  that  the  kt  th  component  of  Pn  («  (/*  j ))  can  be  determined  from  A  and  the 

invariant  components  of  the  q  th  section,  1  <  q  <  n,  the  elementary  propagation  scheme  is 
applicable.  In  other  words,  if  there  exists  an  order  of  propagation,  i.e.,  an  assignment  of  1,  •  •  •  ,  n  to 
k ,,  k j,  such  that  the  above-mentioned  condition  is  satisfied,  then  the  broadcasting 

.  dependence  can  be  decomposed  into  elementary  propagation  with  the  given  ordering.  To  minimize 

the  computation  cost  of  the  conditionals,  the  fcy  tb  component  of  P0  is  constrained  to  be  a  linear 
function  of  the  invariant  components  and  of  A. 

The  condition  under  which  a  broadcasting  dependence  mapping  is  decomposable  into  elementary 
propagation  can  be  formulated  in  simple  mathematical  terms.  Assume  for  the  moment  that  the  >  th 
I  row  of  Pis  different  from  ?,T  for  all  «  The  broadcasting  dependence  is 

i 


v*  >.*.«  v  ,* v 


Suppose  the  natural  ordering  is  a  feasible  propagation  order,  i.e.,  k,  =  q ,  1  <  q  <  n.  The 
assumption  that  the  q  th  component  of  P9  is  a  linear  function  of  A,  of  the  first  q  -1  components  of  P0 
and  of  the  last  n  -q  components  of  within  the  q  th  section,  implies  that 


/«■  1  /=*♦!  >=> 


where  ,  u„  and  ztJ  are  rational  numbers.  Writing  the  above  expression  for  q  =  1,  ■  •  •  ,  n  in 
matrix  notation  and  grouping  the  components  of  P 0  and  P ,  on  opposite  sides,  we  get 

1 

l2i  1  0 

*31  *32  ' 

*•1  *•  2  ‘  ‘  *■•-!  1 

This  implies  that  P0  L~X[UPX  -  X A),  hence 
B  =»  ,  and  A'  —  L  , 


P  01 

0  “12  “13  “l. 

Pll 
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•  “23 

P  12 

* 

0 

“■-In 

P  On 

0 

Pi. 

where  L  is  unit  lower  triangular  and  U  is  ttrictly  upper  triangular.  Since  the  actual  value  of  A  is 
unimportant,  for  simplicity,  we  will  assume  from  now  on  that  A  is  a  null  vector,  hence  d(P)  =  BP 

Theorem  3.1: 

A  broadcasting  dependence  mapping  d  is  decomposable  into  elementary  propagation  using  m 
propagation  variables  if  there  exists  an  order  n  permutation  matrix  Q  such  that 


QXBQ 


L'XU  H 
0 


where  L  is  an  m  X  m  unit  lower  triangular  matrix,  U  is  an  m  X  m  strictly  upper  triangular 
matrix,  H  is  an  m  X(n  -m  )  matrix,  and  is  the  identity  matrix  of  order  n-m  . 

Proof: 

First,  note  that  a  symmetric  permutation  of  rows  and  columns  of  B  by  Q  is  equivalent  to 
reordering  the  indices  of  the  computation  n  -space.  Second,  if  the  :  th  row  of  B  is  equal  to  e*,T 
then  P,  -*  P0  =  d(P ,)  implies  that  p0,  =  p ,, ,  hence  propagation  in  the  direction  of  e,  is 
unnecessary.  Thus,  from  our  previous  analysis,  if  B  can  be  symmetrically  permuted  into  the 
stated  form,  Q  defines  an  order  of  propagation  and  the  appropriate  elementary  propagation 
may  be  carried  out  by  m  propagation  variables,  one  for  each  of  the  first  m  elementary 
directions  selected  by  the  column  permutation  Q  and  defined  with  conditional  recurrences  as  in 
(!)•□ 


Suppose  B  is  decomposable  into  elementary  propagations,  then 


BP i  -  P o  **Q  'BQ(Q-lP J  -  9_1P0  • 


Partition  vector  Q  lP  into 


Pi 

Pt 


,  where 


Pi  —  (P*,.  Pkt .  •  '  •  -  P*„,)T 
By  Theorem  3.1,  equation  (2)  is  equivalent  to 


L  lUPu  +  HP  12  ™  Poi 
P  12  ■“  P02 


(2) 


The  first  equation  implies  that  Ufn  +  Ltt$n  ”  Lp0l,  hence, 

Po*,  *  -*f,  )TP oi  +  <7^12  .  1  <  ?  <  m  ,  (3) 

where  L  *  (1,  /,  •  •  •  lm  )T,  1/  =  (if,  t?2  •  •  ■  if,,,  )T,  and  is  the  m-vector  with  all  0’s  except  for 
the  q  th  component  which  is  a  1.  Since  L  is  unit  lower  triangular,  the  expression  (7^  -  rj^  )Tp  01  is  a 
linear  function  of  the  first  q -1  components  of  Q~lP „.  Similarly,  since  U  is  strictly  upper  triangular, 
(^«TPn  +  T,rH?u)  •*  a  linear  function  of  the  last  n -q  components  of  Q'lP ,.  Hence,  the  q  th 
component,  pot>,  of  Q~XP 0  can  be  determined  from  the  invariant  components  of  the  fth  section. 
Thus,  once  the  matrices  Q  ,  L  ,  U  and  H  are  known,  the  conditionals  for  defining  the  propagation 
variables,  2*  ,  1  <  9  <  m  ,  can  be  derived  from  (1)  and  (3). 

Example  1: 

Let  the  indices  of  the  four-dimensional  computation  space  be  i2,  i3,  i4).  Consider  the 
broadcasting  dependence 


a  (*  |.*2  *3-*l)  “*  M‘2+2»3,2J2+4*3+‘4,»2+,3+*4.*2+*3+‘4).  ®r, 


0  12  0 
0  2  4  1 
0  111 
0  111 


The  broadcasting  is  decomposable  into  elementary  propagation  with  order  of  propagation 
{ Xr , ,  k«,  k3,  £4}  =  {1,  3,  2,  4},  as  defined  by  the  following  matrices  : 


1  0  0  0 

0  2  10 

10  0  0 

1  , 

„  1 

-—100 

0  0—1 

0  0  10 

2 

2 

0  10  0 

.  L  =- 

-2010 

.  u  = 

0  0  0  1 

0  0  0  1 

0  -10  1 

0  0  0  0 

For  example,  the  3rd  component  of  P0  can  be  determined  from  any  index  point  P  on  the 
second  section  of  the  propagation  path  using  equation  (3)  : 


P03  =  *?Q'lP  ~  ft  +  I' 


4  • 


Similarly, 

Poi  =*  i 2  +  2*3  ,  p  g  section  1 

P02  *  2i,  +  i4  ,  P  6  section  3 

P04  =*  »3  ,  p  €  section  4 

Thus,  for  example,  the  definition  of  the  propagation  variable  a3  is 


W) 


o3(P  +  e*,  ) 

if  (‘3+1)  < 

if  (,3-i)  > 

*2(P  +  «*,) 

if  (*3+I)  =* 

*i(P  ~  ?*,) 

^  ('3-1)  —  ■' ; 

a2(P) 

*1  +  *2 

if  *3  =  2  +  * 

which  is  valid  for  all  index  points  P  within  the  computation  domain. 

For  example,  the  dependence  a  (1,1, 1,2)  -*  b  (3, 8, 4,4)  is  now  replaced  by  a  propagation 
dependence  path  consisting  of  four  sections: 

a  (1,1, 1,2)  -  o,(l,  1,1, 2)  -  o, (2,1, 1,2)  - 

a 3(3, 1,1,2)  -  03(3,1,2,2)  -  0 3(3, 1,3, 2)  - 

a2(3, 1,4,2)  -  0 2(3, 2, 4, 2)  -  a 2(3, 3, 4, 2)  -*•••—  0 o(3,7,4,2)  — 

o  4(3, 8,4, 2)  —  0  4(3,8,4,3)  -  b  (3, 8,4, 4). 

For  the  computation  of  a  (3, 1,1, 2),  which  depends  on  the  same  value  b  (3, 8,4, 4),  the  propagation 
dependence  path  starts  as 

o(3, 1,1, 2)  —  a ,(3, 1,1.2)  -  o3(3,1,1,2) 

and  then  joins  the  path  shown  above.  It  can  be  verified  easily  that  the  definitions  of  the 
propagation  variables  do  trace  out  these  propagation  paths.  0 


Despite  the  simple  mathematical  formulation  given  in  Theorem  3.1,  determining  whether  a 
broadcasting  dependence  mapping  is  elementary  decomposable  and  finding  a  feasible  order  of 
propagation  are  non-trivial  tasks.  A  simple-minded  approach  is  to  6nd  Q  by  exhaustive  search, 
possibly  with  the  help  of  heuristics;  if  such  Q  exists  then  the  broadcasting  mapping  is  decomposable. 
This  approach  is  not  very  appealing  as  the  worst  case  complexity  is  exponential.  Fortunately,  as  will 
be  shown  next,  decomposability  of  a  broadcasting  mapping  into  elementary  propagation  is  related 
directly  to  the  non-zero  structure  io  B,  the  linear  part  of  the  broadcasting  mapping.  The  proof  of  this 
claim  provides  a  polynomial  time  algorithm  for  constructing  a  feasible  order  of  propagation. 
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For  the  following  discussion,  we  assume  that  the  tth  row  of  B  is  different  from  ?,T 

this  is  not  true,  we  can  find  a  permutation  matrix  R  such  that  R~lBR  = 
m  X  m  matrix  Bn  satisfies  the  assumption,  and  let  B  *-  Bn,  n  —  m  . 


8|1  &12 
0 

■ 


for  all  i .  If 
,  where  the 


Let  G  (fl)  »  ( V  ,E )  be  the  directed  graph,  or  the  digraph,  associated  with  the  matrix 
B—  j,  with  a  vertex  v*  6  V  for  each  of  the  n  indices  of  the  n -space  and  an  edge,  G  E , 

from  tf,-  to  Vj  if  is  non-zero.  The  connectivity  of  the  digraph  G  (B)  corresponds  to  the  non-zero 
structure  in  B.  Let  F(G  (B))  =  {  v,  j  u,  €  V,  indegreel v, )  =  0  }  be  the  set  of  free  vertices  in  the 
graph  G(B),  and  V  =  V  -  F(G{B)).  We  say  that  the  matrix  B  satisfies  the  reachability  condition  if 
its  associated  digraph  G  (B)  has  the  property  that  every  vertex  in  V  is  reachable  from  at  least  one  of 
the  free  vertices,  «.e.t  if  for  each  v,  €  V ,  there  exists  at  least  one  v,  6  F(G(B ))  such  that  there  is  a 
directed  path  from  t/f  to  Vj  in  C(£).  The  digraph  associated  with  the  broadcasting  dependence 
mapping  in  Example  1  is  shown  in  Figure  2. 

The  following  theorem  constitutes  the  main  result  of  this  subsection: 

Theorem  3.2> 

The  following  statements  are  equivalent  : 

(1)  B  satisfies  the  reachability  condition, 

(2)  there  exists  a  permutation  matrix  Q  such  that 

Q  'BQ  =  L  'U  , 

where  L  is  a  unit  lower  triangular  matrix  and  U  is  a  strictly 
upper  triangular  matrix. 


Figure  2  :  The  directed  graph.  G(B)  associated  with  the  broadcasting  dependence  B  of  Example  1 
The  vertices  are  labelled  by  the  indices  they  represent.  B  satisfies  the  reachability 
condition. 
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Prooft 

not  (1)  =»  not  (2): 

Suppose  (1)  does  not  hold  and  let  vr  €  V  be  a  vertex  that  is  not  reachable  from  the  free 
vertices;  v,  necessarily  belongs  to  a  strongly  connected  component  in  which  all  the  vertices  are 
not  reachable  from  the  free  vertices.  Without  loss  of  generality,  let  this  strongly  connected 
component  contain  the  set  of  vertices  V,l  »  {t», ,  *v+i,  },  t'.e.,  assume  that  the  matrix 

B  is  of  the  form 

[fin  0 

[fin  Sh  <Ba 

where  the  non-zero  structure  of  corresponds  to  the  connectivity  of  the  strongly  connected 
component  formed  by  the  vertices  in  V 

Suppose  (2)  holds.  Since  U  is  strictly  upper  triangular,  Q  must  be  such  that  the  diagonal  and 
the  subdiagonal  elements  of  Q'lBQ  can  be  eliminated  using  the  elements  on  its  first 
superdiagonal  as  pivots.  Consider  the  diagonal  element  0„  in  the  submatrix  B&,  there  are  three 
possible  choices  of  Q  : 

(i)  Leave  the  position  of  0„  unchanged.  Statement  (2)  does  not  hold  unless  0r  =0  for  all 
j  >  r  ,  which  contradicts  the  assumption  that  vertices  in  K,  are  strongly  connected. 

(ii)  Reorder  elements  within  S,2.  Suppose  the  j  th  row  and  j  th  column,  r  <  j  <  n  ,  are  swapped 
with  the  rth  row  and  r  th  column.  The  rth  column  becomes 

(01)  02 j  0f-l.j  0)i  0r  +  l.j  0J-I.J  0')  0J  +  1)  00)  )  ' 

which  is  equal  to 

(0  0  •  •  O0„  0,+yj  ■  ■  0,_y,  0r)  0)  +  u  ■  ■  ■  0t))T  . 

As  in  case  (t),  statement  (2)  holds  only  if  #t;  =  0  for  all  t  >  r,  which  contradicts  the 
assumption  that  the  vertices  in  V’,  are  strongly  connected. 

(n't)  Move  0„  out  of  S H.  Suppose  the  rth  row  and  rth  column  are  swapped  with  the  i th  row  and 
i  th  column,  i  <  r .  The  i  th  column  becomes 

( 0\r  02  r  0>-l.r  0rt  0i*\,r  0r -\.r  0ir  0t  +l.r  0mr  )  ■ 

which  is  equal  to 

(0  0  •  ■  0fl,r  0  0  0„  0r  +  Vt  •  ■  0„)T  . 

Since  0jr  =*  0,  j  <  i  -1,  statement  (2)  does  not  hold  unless  0}r  =  0  for  all  j ,  a  contradiction 
to  the  assumption  that  the  vertices  in  V ,  are  strongly  connected. 

Thus,  not  (1)  implies  not  (2). 
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To  prove  the  converse,  we  seed  the  following  lemma  : 

Lemma  1.1  < 

Let  F(G{B ))  be  {»,,  v2,  •••,»/}  and  V  be  [vf  v,  +2,  •■.».}.  1  <  /  <  «  If  B 
satisfies  the  reachability  condition  then  there  exists  pivot,  0h ,  1  <  i  <  J  ,  /  <  j  <  n , 
such  that  B * ,  the  order  n  -1  matrix  obtained  from  B  by 

(i)  eliminating  the  j  th  column  of  B  using  Ay  as  pivot,  and, 

(is)  deleting  the  ttb  row  and  sth  column, 
also  satisfies  the  reachability  condition. 

Proof  of  Lemma  3.1  > 

By  induction  on  the  cardinality,  /  ,  of  V ,  the  set  of  non-free  vertices  in  G(B). 

Base  case  :  /  — «  1, 

0  •  '  0  0i  /  +, 

B.  - 

0  0  0i,i*i 

0  •  -0  0/  +|  /  +i 

Since  v/ *i  is  reachable  from  at  least  one  of  the  vertices  in  F(G(B,)),  there  exists  a  0,j  * », 
1  <  »'  <  /  ,  which  is  non-xero.  Using  this  element  as  pivot,  we  have 

0  0 

BS  -  . 

0  0 

which  satisfies  the  reachability  condition.  Therefore,  the  base  case  is  true. 

Suppose  Bk ,  k  >  1,  satisfies  the  reachability  condition.  By  picking  an  appropriate  permutation 
R  of  order  (/  +fc ),  it  is  always  possible  to  rearrange  the  rows  and  columns  of  Bt  such  that  the 
principal  submatrix,  tl  obtained  from  deleting  the  last  row  and  last  column  of  R  lBR 
satisfies  the  reachability  condition.  To  do  this,  first  find  the  depth-first  spanning  forest  [9|  of  the 
graph  G  [Bk  )  and,  supposing  v, ,  f  <  u  <  f  +k  ,  is  a  non-isolated  leaf  node,  form  R  to  swap 
the  u  th  column  and  (/  +*)th  column.  For  clarity  of  presentation,  we  now  assume  that  R  is 
the  identity  matrix. 

We  complete  the  induction  step  of  the  proof  by  showing  that  either  the  element,  £r( , 
1  <  r  <  /  ,  /  <  q  <  k- 1,  which  is  a  feasible  pivot  for  the  case  l  — >  k-l,  or  the  element 
0,i  is  a  feasible  pivot  for  l  *  k  . 

Suppose  the  proposition  is  true  for  /  *  k-l,  that  is,  there  exists  a  pivot  !),,  such  that 
satisfies  the  reachability  condition.  Without  loss  of  generality,  let  0if.  /  +1  <  q  <  k-l.  be 
such  a  pivot,  then, 
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[o  "  0  02, J  +1  02,2-1  0  02.2  +  1  02,2-1 


0  "  0  0t  -U +1  "  02-1.2-1  0  02-1.2+1  02-1,2-1 


where  «■  0%i  -  ai0li,  with  at  **  2  <  i  <  (k- 1),  (/  +1)  <  j  <  (*~T)- 


Now,  if  we  pick  0Xt  as  pivot  for  Bt ,  we  have, 


lo  ••  0  02,1  +1  02,2-1  ®  02.2  +  1  "  022 


»12  : 


where  0tJ  —  0,,  -  a,0t)  with  ak  —  »'  *»  *  or  —  k.  Since,  by  assumption,  B**_, 


satisfies  the  reachability  condition,  B»  does  not  satisfy  the  reachability  condition  only  if 
0a,  ■»  0,  2  <  i  <  (fc-1),  and, 

022  7*  0, 

where  the  inequality  ensures  that  v*  does  not  belong  to  F(G(Bt*)).  Thus,  if  0Xi  is  not  a 
feasible  pivot  for  Bk  .  then 


&2-1012 

0  •  0  02,1  +1  •  02.2-1  022 


Since  vk  is  assumed  to  be  reachable  from  at  least  one  of  the  vertices  in  B(G(Bt)),  0lt  ^0. 
Picking  0x2  as  pivot,  we  have, 


0 

0 


0  •  •  0  Pi,f+i  •  ■  &k,k- 1  0 


where  0kj  —  9k, 


-0ij ,  (/  +1)  <  j  <  k ,  which  satisfies  the  reachability  condition  since 


p*  €  F(  <*(£!»*)).  Therefore,  either  0^  or  0xk  of  Bk  is  a  feasible  pivot  and  hence,  the 
proposition  is  also  true  for  /  *  k  .  D 

Proof  of  Theorem  3.2  (continued)  : 

(1)  -»  (2): 

Suppose  B  satisfies  the  reachability  condition  and  let  F(G(B))  be  {i/t,  •  •  •  ,  v,  }.  The 

following  procedure  finds  the  permutation  matrix  Q ,  a  unit  lower  triangular  matrix  L  ,  and  a 
strictly  upper  triangular  matrix  V  such  that  Q~lBQ  =  L'lU  : 

g  1;  B,  -  * 

Repeat 

{ 

v  <-  n  -  /  +1;  /i  «-  |F(C(B|)1; 

Find  a  feasible  pivot  0ti  €  .  1  <  »  <  /<  <  j  <  u  ,  (Lemma  3.1);  (t) 

/?/  «—  (t',1)  row  permutation  matrix; 

H  -  Rt  Bi  /?fT; 


H  ’  «-  LH  with  the  first  row  and  column  removed; 
0 

1  *“  0  L,R,  1  : 

ol 


^  [  0  /?,  ’ 

QiT  «—  column  permutation  which  moves  the  zero  columns  of  H’  to  the  left; 

B,+ 1  -  QtH'  QiT, 

Ol 


U-t+i  o 

-  0  Q, 

i+l; 


Q  ; 


} 

until  B/  becomes  a  null  matrix; 
V  *-  L  BQt, 

L  -  LQt; 

Q  -  «T; 


The  elimination  step  in  the  above  repeat  loop  has  to  be  done  with  exact  arithmetic  as  the 
connectivity  of  the  digraph  depends  on  the  exact  values  of  the  matrix  entries. 


By  Lemma  3.1,  if  Bt  satisfies  the  reachability  condition  then  a  feasible  pivot  can  be  found  at 
line  (t)  such  that  B<+1  also  satisfies  the  reachability  condition.  Hence,  if  B,  satisfies  the 
reachability  condition,  each  iteration  of  the  repeat  loop  eliminates  at  least  one  diagonal  element 
and  the  associated  subdiagonal  elements  and  a  strictly  upper  triangular  matrix  results  when  the 
procedure  terminates.  Thus,  (1)  implies  (2).  U 


Theorem  3.2  establishes  the  rather  surprising  property  that  the  decomposability  of  a  rank 
deficient  matrix  into  the  product  of  a  unit  lower  triangular  matrix  and  a  strictly  upper  triangular 
matrix  (up  to  symmetric  permutation)  depends  only  on  the  zero  (or  non-zero)  structure  or  the  matrix 
and  not  on  the  value  of  its  non-zero  entries.  Utilizing  this  result,  the  test  for  decomposability  of  a 
broadcasting  dependence  mapping  into  elementary  propagation  can  be  accomplished  simply  by  finding 
all  the  strongly  connected  components  in  the  digraph  G  (B)  and  verifying  that  each  strongly 
connected  component  has  at  least  one  incoming  edge  from  a  ver'ex  external  to  that  component.  The 
test  has  complexity  of  0(MAX(n  )£)))  (9).  The  procedure  detailed  in  the  proof  of  Theorem  3.2,  which 
has  complexity  of  0(n*),  can  be  used  to  find  a  feasible  order  of  propagation,  if  one  exists. 

Cxsmplu  3  (Back-substitution  solver)  t 

Given  a  non-singular  n  X  n  lower  triangular  band  matrix  A  of  bandwidth  p  ,  and  an  n  -vector 
6  ,  the  solution  x  of  the  system  of  equations 

Ax  *  b 

can  be  found  by  back-substitution.  The  algorithm  can  be  expressed  in  the  following  sequential 
program  loop  . 

for  t  =  1  to  n  do 

*,  —  (*.-£ 

)<> 

which  can  be  translated  into  the  following  LDA: 

j  x(i,  j- 1)  -  a„x(j ,  j)  i  ^  j 
\  z(,.  i-l)/a„  i  =  j 

The  domain  of  computation  is  C  =  {  l  <  «  <  n  ,  max(l,  t-p+1)  <  j  <  i  },  with  the 

boundary  condition  x(i,  j0)==  •  where  ;0=max(l.  i  -p  +1)  -  1,  and  the  final  results  are 
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fo  l] 

ting.  B  —  0  j  .  Sii 


x,  »  x  (i ,  t ).  The  dependence  of  x  (» ,  j )  on  x  (;' ,  j )  is  a  broadcasting,  B  —  |q  j  j  .  Since 

the  second  row  of  B  is  equal  to  ?*T,  we  let  B  *-  Bu  *-  (  0  j,  and  hence  G  (B)  is  a  digraph  with 
only  one  vertex  (representing  index  t ),  which  is  free.  Since  B  satisfies  the  reachability  condition, 
the  broadcasting  can  be  decomposed  into  elementary  propagation  with  one  propagation  variable, 
where 

Introducing  the  propagation  variable  x  ,  the  LDA  now  becomes 
|  x(»,  >-l)  -  o0x(«,  j)  i  yt  j 
*(*.>)—  j  x(<,  .-!)/«*  i  —  j 


I  J) 

* 1 l*  ’  J  ™  l  x  (» -I,  j  )  otherwise  ' 

In  the  recurrence  of  x  ,  only  two  conditionals  are  needed  because  the  coordinates  of  the  points  in 


Figure  3a  t  Broadcasting  dependences  in  the  back-substitution  algorithm. 


the  computation  domain  satisfy  t  >  j .  The  broadcasting  dependences  and  the  corresponding 
propagation  dependences  after  decomposition  for  the  case  p  -  6  and  n  *  8  are  shown  in 
Figure  3a  and  3b,  respectively.  D 

Although  it  is  powerful,  elementary  propagation  does  not  handle  all  of  the  broadcasting 
dependence  mappings  commonly  found  in  numerical  algorithms.  For  example,  the  broadcasting  in  the 
1-D  recursive  filtering  algorithm  shown  at  the  beginning  of  this  section  is  not  decomposable  into 
elementary  propagation.  In  the  next  subsection,  we  generalise  our  approach  to  cover  all 
broadcastings. 

3.3.  Composite  Propagation 

As  noted  in  the  previous  subsection,  constraining  the  propagation  dependences  to  elementary 
dependence  vectors  is  too  restrictive.  In  this  subsection,  we  extend  the  basic  principle  of  propagation 
to  allow  propagation  dependence  vectors  which  are  not  necessarily  elementary.  This  composite 
propagation  technique  is  a  generalization  of  the  elementary  propagation  scheme  discussed  in  the 
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Figure  3b :  Decomposition  of  broadcasting  dependences  shown  in  Figure  3a  into  elementary 
propagation  dependences. 


previous  subsection.  We  prove  that,  with  such  generalisation,  any  broadcasting  dependence  can  be 
transformed  into  propngation. 


A  major  complication  in  allowing  non -elementary  dependence  vectors  on  a  propagation  path  is 
that  the  conditionals  for  delaiag  the  propagation  variables  are  no  longer  dependent  on  the  index 
components  which  remain  invariant  in  a  section  of  the  path.  This  can  be  seen  in  the  following 
example  : 

Example  S  t 

Consider  the  broadcasting  dependence 

1  * 

1  1 

«(*».»*)-•  Bwm  j  j  . 

It  is  not  transformable  into  elementary  propagation  as  B  does  not  satisfy  the  reachability 
condition.  By  using  the  non-elementary  propagation  dependence  vectors,  w ,  «  ( 1 ,  - 1  )T  and 
S2  «  (1,  -2)t,  however,  this  broadcasting  mapping  can  be  transformed  into  propagation  : 

4  (*  it  *2)  -*  Ii(*  i»  *2)  . 
where, 


W)- 


I,(P  +  a,  )  if  (i,  +  2i,  -  1)  >  0 
I,(P  -  «5,)  if  (»,  +  2*2  +  l)  <  0 
JJP  +  «>,)  if(»i  +  2», -1)  — 0 
if  («,  +  2i2  +  1)  -0 
W)  if(.,  +  2.2)-0 


W) 


a2(P  +  ti>2  )  if  (t,  -  i2  +  3)  <  0 
I2(P  ~  ®2>  if  («i  -  *2  -  3)  >  0 
b(P  +  ®2)  if  (»i  -  »2  +  3)  =*  0 
b(P-SS2)  if  (•  1  -  »2  -  3)  *  0 
*(P)  if  (*1  -  *2)  **  0 


Some  of  the  broadcasting  dependences  and  the  corresponding  propagation  dependence  paths  are 
shown  in  Figure  4a  and  4b.  Interested  readers  can  verify  that  the  propagation  scheme  does 
indeed  implement  the  broadcasting  correctly.  G 

Unlike  elementary  propagation,  in  which  the  index  points  along  a  section  share  some  common 
invariant  components  that  can  be  used  for  determining  the  sense  and  extent  of  the  section,  composite 
propagation  does  not  necessarily  preserve  any  of  the  components  of  index  points  on  a  section. 
Consequently,  the  technique  outlined  in  the  previous  subsection  for  defining  the  appropriate 
propagation  variables  does  not  apply  directly.  By  restricting  our  attention  to  particular  classes  of 
propagation  dependence  vectors,  however,  we  are  able  to  derive  a  precise  mathematical  description  of 
composite  propagation.  This  is  detailed  in  the  following. 


Mathematical  formulation 

Let  P  be  an  index  point  in  the  computation  domain.  C.  Originally,  P  is  expressed  as  a  linear 
combination  of  the  n  elementary  vectors,  {?* },  which  make  up  the  canonical  basis, 

P  —  +  P  2*2 

Suppose  we  perform  a  change  oj  basis  (10|  to  C  from  the  canonical  basis  to  the  basis  defined  by  the  n 
linearly  independent  integral  vectors,  wt,  S2,  •  ■  •  ,  wn  ,  which  form  the  columns  of  a  matrix  IK 
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Figure  4a  «  Some  of  the  broadcasting  dependences  of  Example  3.  The  broadcast  data  are  located 
along  the  line  i  |  ■■  » 2. 


With  respect  to  this  new  basis,  the  index  point  P  becomes  P  where 

P  -*  f  ,3,  +  pjSfj  +  •  •  •  +  p„  55,  =*  P  —  W'lP  , 

that  is,  a  change  of  basis  from  {?, }  to  the  columns  of  W corresponds  to  a  linear  mapping,  W~x,  which 
maps  index  points  P  6  C  to  index  points  P  of  the  transformed  computation  domain,  C.  To  ensure 
that  C  contains  only  integral  points,  W  must  be  unimodutar,  i.e.,  |  det(W')  |  =  1. 

Let  d  =*  B  be  a  broadcasting  dependence  mapping  in  C.  Under  the  change  of  basis  to  IV,  the 
broadcasting  dependence  P  — »  BP  becomes 

W  'P  -  WlBP  m  P  -  W  lBW{WlP)  -  W  'BWP 

in  C.  Thus,  the  broadcasting  dependence  B  undergoes  a  similarity  transformation  {10},  B  *  W'BIV. 
Since  9  represents  also  a  broadcasting  dependence  in  C,  if  B  is  elementary  transformable,  the 
elementary  propagation  paths  of  5  in  C  can  be  mapped  by  W  into  propagation  paths  in  C  that 
implement  the  broadcasting  dependence  B.  This  is  formally  stated  in  the  following  theorem. 

Theorem  3.3  t 

A  broadcasting  dependence  B  is  decomposable  into  composite  propagation  with  propagation 
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Figure  4b :  Decomposing  broadcasting  dependences  shown  in  Figure  4a  into  propagation 
dependences  with  dependence  vectors  (1  -1)T  and  (I  -2)T. 


dependence  vectors  0,,  02,  ■  •  •  ,  Sm ,  if  there  exists  an  n  X  n  unimodular  integral  matrix 
W  ast  (0,,  •  •  •  ,  0,  ),  called  a  feasible  basis,  such  that 


W  lBW  -  H 


L~lU  H 

0  l„-m 


where  L  is  an  m  X  m  unit  lower  triangular  matrix,  V  is  an  m  X  m  strictly  upper  triangular 
matrix,  H  is  an  mX(n-m)  matrix,  and  /„_m  is  the  identity  matrix  of  order  n  -  m  . 

Proof: 

By  Theorem  3.1,  the  broadcasting  dependence  B  can  be  decomposed  into  elementary 
propagation  with  propagation  variables,  a< ,  defined  with  respect  to  the  transformed  domain. 
P  €  C, 
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*  !.♦  >,*  i  'iWVWiW 


(4) 


*(?)- 


*(*  +  *)  if (p  +  <?(  k,  <  p 0i 

*i(P  -  «,)  if  (Z5  -  ?,  H.  >  p„, 

T<+,(P  +  *. )  if  (P  +  t,  If  -  Poi  ,  VP6C,i<m, 
T,+i(P  -?,)  if  (P  -  ?.  -  Po. 

?i+1(£)  if  (P  )li  -p«. 


where  ?,  is  the  ith  canonical  vector  (in  the  transformed  domain)  and  p0.  •*  the  itb  component 
of  the  index  point  P0  at  which  the  broadcast  quantity  is  located.  Using  equation  (3),  p0,  can  be 
expressed  as  the  invariant  component  of  the  i  th  section  of  the  elementary  propagation  path  in 
C.  Applying  the  linear  transformation  W  to  the  domain  C,  and  substituting  W~lP  and  uJ,  for 
P  and  ?j ,  respectively,  in  the  conditionals,  expression  (4)  becomes 


*i(P) 


a, 

(P  +  ) 

it(WlP  + 

8, 

H, 

<  Po. 

a, 

( P  -©,  ) 

if  ( W'lP  - 

H, 

>  Poi 

ai 

+i(P  +  ) 

if  ( W'lP  + 

S, 

H. 

“  POi 

«. 

+JP  ~  ®, ) 

if  ( - 

Hi 

*“  Po. 

a. 

+i  (P) 

if  (W'lP  Hi 

- 

Po. 

Po,  ,  V  P  6  C  ,  »  < 


m 


(5) 


where  poi  is  obtained  from  p0, ,  which  is  a  linear  function  of  p0j .  j  ^  » ,  with  the  appropriate 
substitutions.  Thus,  the  broadcasting  dependence  B  can  be  transformed  into  composite 
propagation  with  propagation  dependence  vectors  ,  l  <  »'  <  m  .  Obviously,  the  elementary 
propagation  scheme  discussed  in  the  previous  subsection  is  a  particular  case  of  composite 
propagation  in  which  the  feasible  basis,  W,  is  an  order  n  permutation  matrix.  Q  .  D 


Example  4  < 

Apply  Theorem  3.3  to  Example  3: 


Let  IK  — 

1  1 

-l  -2 

,  then 

3=  * 

0  -3 

0  2 

-  L  ~l 

V  - 

‘  1  0 

0 

--  1 

3 

Hence  B  is  decomposable  into  elementary  propagation  with  propagation  variables 


i 


fiiP) 


i 

i 


ai (P  +  ?  i  )  if  (P  +  «i  Hi  <  Spi 
a,(P  -  ?,)  if  (P  -  Hi  >  -3 p2 
*2 {P  +  «i)  if  {P  +  t\  Hi  *  "3P2  , 
a2(P  -  ?,)  if  [P  -3?! 

a2(P)  if  (PHi  *  -3p2 
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HP) 


H?  +  {P  +  Is  <  — yP  1 
HP  -H  it  IP  -?2|*>-jFi 
4(P+?*)  if(P  +?*|2-  -|?i 
MP-?,)  if(P  -?2)l2--|?. 

MP)  if(P<2--|?1 


Some  of  the  elementary  propagation  paths  in  the  transformed  domain  are  displayed  in  Figure 
4c.  Applying  the  linear  transformation  W  to  the  domain  C  and  performing  the  substitutions 
W'lP  »  (pIt  y2)T  »  ((2s t  +  »2),  — (»* i  +  »2))T, 

(W  'P  ±  »,|j  —  (p i  +  w„)  —  2 +  «'j  ±  1. 

(JV-’Pl,  -  -  2s* ,  +  .* 

(IF*lP  ±  a2|j  —  (jF2  +  W„)  rn>  -(»,  +  t2  ±  2),  and 

in  the  conditionals  in  (5),  we  get  the  propagation  variable  definitions  given  in  Example  3. 


The  choice  of  IF  is  non-unique.  For  example,  by  selecting  W 
composite  propagation  paths  is  obtained  (Figure  4d).  Q 


1  0 
-1  1 J ' 


a  different  set  of 


Existence  of  s  feasible  basis 

From  Theorem  3.2,  a  broadcasting  dependence  B  is  decomposable  into  elementary  propagation 
if  and  only  if  B  satisfies  the  reachability  condition.  The  following  corollary  of  Theorem  3.3  is  a  trivial 
extension  of  Theorem  3.2  : 

Corollary  3.1  i 

A  broadcasting  dependence  B  is  decomposable  into  composite  propagation  if  and  only  if  there 
exists  a  unimodular  transformation  IF  such  that  B  =  IF'BIF  satisfies  the  reachability 
condition. 

Hence,  given  B,  theoretically  one  could  determine  all  the  feasible  bases  by  checking  the  non-zero 
structure  of  B  for  all  unimodular  transformations  IF.  The  next  theorem  is  the  major  contribution  of 
this  subsection. 

Theorem  3.4  i 

All  broadcasting  dependences  can  be  decomposed  into  composite  propagations. 

We  prove  the  theorem  by  showing  that  for  any  rank  deficient  matrix,  B,  there  exists  at  least  one 
unimodular  transformation  IF  such  that  B  —  W~XBW  sac  fies  the  reachability  condition.  For  the 
trivial  case  where  B  ™  0,  Theorem  3.4  is  automatically  satisfied  as  B  is  a  null  matrix  for  all 
unimodular  transformations  IF.  In  the  following  discussion,  we  assume  that  B  0.  To  prove  the 
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Figure  4c  s  Elementary  propagation  dependences  obtained  if  a  change  of  basis.  IE.  is  applied  to  the 
domain  shown  in  Figure  lb.  For  clarity,  the  self-dependences  are  not  displayed. 


main  result,  we  need  the  following  lemma: 

Lemma  2.1 1 

Given  a  rank  deficient  matrix  B  0,  there  exist  integral  n -vectors  0,  and  $ ,  satisfying 
00,  -  0. 
fT0,  -  1.  and 
fTB  kfT,  for  any  scalar  k  . 

Proof  of  Lemma  2J  t 

Since  B  is  rank  delcieat  there  always  exists  an  integral  vector  with  coprime  components.  0,, 
seek  that  00,  ■■  0  For  any  such  0,.  there  exists  n  linearly  independent  integral  vectors,  q , , 


1  <  i  <  n  ,  which  satisfy  f,Tuji  —  1.  If  for  each  of  these  vectors  there  exists  a  scalar  k,  such 
that 

then 

?< TBu}t  =  i,?,1®,, 

hence  i,  =  0.  Therefore,  qjB=0  tot  n  linearly  independent  vectors  which  implies  that 
B  «■>  0,  a  contradiction.  G 

Let  ^  and  55 ,  be  integral  vectors  that  satisfy  the  conditions  stated  in  Lemma  3.2.  Let 
7  =  flTy.  Since  7T  ^  jfc^T  by  assumption,  the  components  of  7  and  ^  can  be  reordered  with  the 
same  permutation  such  that 

»1<I  2  7*  *201  •  (C) 

Without  loss  of  generality,  we  assume  that  no  reordering  is  required. 

We  prove  Theorem  3.4  by  constructing  the  integral  vectors  55, ,  2  <  i  <  n  ,  such  that 
(*)  the  first  row  of  IF-1  is  ?T, 

( it)  the  matrix  IF  =  (3,,  55  2,  •  ••  ,  55, )  is  unimodular,  and, 

( m)  (fTBu5,  =  7T55,  0,  for  all  i  >  2, 

where  ?  and  55,  satisfy  Lemma  3.2  and  condition  (c).  For  such  a  matrix  W ,  B  —  W’XBW  has  the 
form 

0  x  •  x 

B  —  ,  x  ^  0  , 

0 

which  always  satisfies  the  reachability  condition,  hence  IF  is  a  feasible  basis. 

Proof  of  Theorem  3.4  : 

Requirements  (i  )  and  (it)  are  equivalent  to 
(/V)  ~qT  =0,i  >2.  and 
(«’)  det(?  «•-;  •  •  S', )  =  |  ?  |2. 

This  follows  directly  from  the  fact  that  the  first  row  of  IF"1  is  uT/det(lF),  where  u  is  the 
column  vector  whose  i  th  component  is  the  cofactor  of  the  (t.  I)  element  of  IF,  and  from  the 
formulas 

det(  IF  )  =  u  T55 ,, 

det(^  w2  ■  ■  ■  55,  )  =  tfT^, 

obtained  by  expanding  the  two  determinants  with  respect  to  their  first  column 

We  now  construct,  given  q  and  55,  satisfying  the  conditions  of  Lemma  3.2  and  condition  (c), 
vectors  u52,  u53,  •  •  ■  ,  55,  ,  which  satisfy  condition  (in)  to  (t>).  The  associated  matrix  IF  is  a 
feasible  basis, 
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Figure  4d  *  Another  feasible  composite  propagation  scheme  for  the  broadcasting  dependences  shown 
in  Figure  4a.  Note  that  the  propagation  paths  are  shorter  than  those  shown  in  Figure 
4b. 


Let  g, .  1  <  i  <  n  ,  be  the  gcd  of  the  first  i  components  of  ?.  The  g,  s  can  be  computed  via 
the  recurrence 


<7 1  =*  ?!• 

g,  —  gcdfj,.,,  q,),2  <  i  <  n , 


hence.  —  and  are  integers  for  i  >  2. 
9,  9< 


The  set  of  integral  vectors,  { tl?2>  C3l  •  •  •  ,  wa  },  where 


S»l  -1  T 

w,  =*  (u/,1  w,o  ■  ■  ■  w,  i  _|  - 0  •  •  0)T  ,  2  <  i  <  n  , 

Si 


>-i 

E  wo  i)  -  — —  • and' 

i =i  9< 


£  ft -1 

E  *ii»i  *  — *■  • 
,  -1  *• 


(<*) 


satisfies  requirements  ( *'**)  to  (v).  This  is  shown  next. 

The  Diophantine  equation  (c/)  always  has  solutions,  because  the  gcd  of  the  coefficients  on  its 
left  hand  side,  which  is  g, _|.  divides  its  right  hand  side  |U|.  Since,  by  assumption  (c),  the  two 
hyperplanes  defined  by  equation  (cl)  and  equation 


•  -i 


Eft  -l 

Wi]  t}  - 

r«i  ti 

are  not  parallel,  there  exists  at  least  one  solution,  {tu,r  j  <  i },  to  equation  ( c7 )  that  satisfies 
condition  (cl?).  Therefore,  the  set  {®2,  uJ3,  ■  •  •  ,  Wt }  is  well  defined. 

It  is  straightforward  to  verify  that  condition  (cl)  implies  (tv)  and  condition  (cS)  implies  (tit). 
Thus  the  vectors  w, ,  i  >  2,  satisfy  requirements  (iit)  and  ( iv). 

To  see  that  uS2,  u53,  •  •  •  ,  0,  satisfy  requirement  (v),  consider  the  matrix 
V  -(?.  ©,,  .  ©.), 

which  has  the  following  structure 

?  1  w2l  tt'31  wit  ■  ■ 

<J  2  w  22  ^32 


v 


?3  0 


“>33 

0  wu 
0 


w.i 

«0»2 

W.3 


L  In  0 

Partition  the  matrix  V  as 


0 

0  0 


w. 


V  = 


1 1 


rT 


(6) 


where  rT  is  a  row  (n  -l)-vector,  c  is  a  column  (n  -l)-vector  and  V  is  upper  triangular. 

Using  Schurs  determinantal  formula  (12),  the  determinant  of  V  can  be  expressed  as 
det(K)-(?,-rTK,c)det(r). 

Since,  by  construction,  qrw,  »  0,  2  <  i  <  n  , 

? ,r T  +  cT  U  =  0  , 

which  yields  the  relation  r’TU'1  ■« - .  Substituting  this  relation  into  equation  (6)  and 


U 


noting  that 
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det(  V )  =  =  jj  Wfj  =  ?i, 
i*2 

the  determinant  of  V  is  seen  to  satisfy 
det(  V )  -  ?12  +  7Tc  «  , 

which  is  equivalent  to  condition  (t>). 

Therefore,  the  vectors  W, ,  i  >  2,  satisfy  the  requirements  (tit)  to  (t>)  and  hence  satisfy  (t)  to 

(»*)■  □ 

Given  a  broadcasting  dependence,  the  basis  W  constructed  in  the  proof  of  Theorem  3.4  can  be 
used  to  transform  the  broadcasting  dependences  into  composite  propagation  dependences.  This  is, 
however,  not  the  only  feasible  choice  of  basis  in  general.  It  is  unclear  whether  there  exist  efficient 
methods  for  determining  all  such  feasible  choices.  In  the  concluding  section,  we  will  discuss  current 
and  future  research  topics  in  the  area. 


4.  INPUT  PIPELINING 

Input  values  of  an  algorithm,  which  are  used  by  more  than  one  computation,  should  be 
pipelined  to  reduce  the  amount  of  communication  between  the  host  and  the  processor  array  on  which 
the  algorithm  is  implemented.  To  avoid  performance  degradation  due  to  I/O  bottleneck,  it  is 
essential  to  reduce  the  I/O  bandwidth  requirement  of  the  algorithm  so  that  the  host  is  capable  of 
delivering  the  input  data  at  a  rate  which  matches  the  computation  rate  of  the  array. 

Example  S  s 

Consider  the  one-dimensional  convolution  of  signal  x  and  tv  to  form  the  filtered  signal  y  as 
given  by  the  following  LDA  consisting  of  a  single  recurrence, 

»(*',/)—  »(*.  j- 1)  +  »(/)•  *(*'-;') . 

with  computation  domain  {  1  <  *  <  n ,  1  <  ;  <  i }.  Thus,  B„  =  l2,  Bfm  =  |0  l]  and 
B„  -  |1  -1|. 

The  input  streams  w  and  x ,  which  are  needed  for  the  computations  of  multiple  instances  of  y  . 
should  be  pipelined  to  reduce  the  I/O  overhead.  This  can  be  realized  by  transforming  the  LDA 
into  its  equivalent  input-pipelined  form  with  the  introduction  of  the  pipelined  variables,  w  and 
x, 

»(». ;')  *=  v(*.  j'-i)  +  ®(».  j)  ■  ?(»'- ;' ) 

I J)  *  >  I 

w\x')>x  I  w(j )  otherwise 
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y-i)  i.j  >  i 

1 l*(»~y)  otherwise' 

With  this  transformation,  only  one  copy  of  each  of  the  streams  x  and  w  is  needed  from  the 

host.  The  dependence  paths  for  pipelining  the  input  variables  are  shown  in  Figure  5.  D 

In  this  section,  we  show  that  input  pipelining  can  be  treated  as  a  special  case  of  the 
broadcasting  presented  in  the  previous  section  and,  therefore,  that  it  can  be  handled  with  the  same 
technique. 

A  variable  at  is  an  input  stream  if  it  is  not  computed  within  the  LDA,  i.e.  ,  if  /  >  r ,  the 
number  of  recurrences  in  the  LDA.  The  following  defines  the  pipelinability  of  an  input  stream  : 
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Figure  S  :  Pipelining  of  input  streams  w  and  x  in  Example  5  (n  =8).  The  self-dependences  are 
not  displayed. 


Definitions 

An  input  stream  a,  of  an  LDA  is  pipelinable  if  there  exist  one  or  more  dependence  mappings, 

da ,  1  <  i  <  r  ,  in  the  r  recurrences  of  the  LDA  such  that  rank (Bu  )  <  n  . 

Obviously,  if  all  the  Bn's  in  the  LDA  have  rank  equal  to  n,  the  computation  of  a,  at  each  index 
point  P  requires  a  distinct  value  of  a, ,  a,  (du(P )).  In  this  case,  each  of  the  input  values  of  a,  is  only 
used  once  and  therefore  the  stream  is  not  pipelinable. 

Now,  suppose  there  exists  a  Bu  of  size  m  X  n ,  m  <  n  ,  with  rank  less  than  n  .  This  resembles 
broadcasting  in  which  a  particular  (computed)  value  is  needed  by  several  other  computations.  The 
only  difference  is  that,  since  at  is  an  input  variable,  we  have  the  freedom  to  reassign  the  location  of 
the  broadcast  value  provided  that  the  assignment  does  not  result  in  conflicts,  that  is,  we  do  not  assign 
more  than  one  au  to  an  index  point. 

Let  R,  be  an  n  X  m  integral  matrix,  the  relocation  transf  ormation  of  input  variable  at .  R, 
has  to  be  choeen  such  that 

RiBu(P t  -  P2)  —  0  only  if  B,/(Pi  -  P2)  =»  0  , 


that  is,  R/  should  have  full  column  rank.  Applying  /?<  to  dt, ,  we  have 

d«(P)  “  BaP  -  Am  =  R/BjiP  -  Ri&u  . 


Since  Bn  is  square  and  rank  deficient,  it  can  be  treated  as  a  broadcasting  dependence  mapping  and 
the  technique  discussed  in  the  previous  section  applies. 


In  Example  5,  the  chosen  relocation  transformations  for  the  input  variables  w  and  z  are 
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By  decomposing  these  two  broadcastings  respectively  into  elementary  propagation  and  composite 

[i  il 

j  q  ),  we  obtain  the  result  given  in  Example  5. 


Example  0  i 

The  matrix-matrix  multiplication  algorithm  is  an  LDA  comprising  one  recurrence  equation 
c  (»,;,*)  -  c  (i  ,j  ,k  -1)  +  o  (»',*)  •  b(k,j)  ,  1  <  <  N  , 

in  which  the  input  streams  a  and  b  are  pipelinable.  Applying  the  procedure  described  above, 
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and  transforming  the  broadcasting  dependences  into  propagation,  the  input-pipelined  version  of 
the  LDA  is 
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S.  CONCLUDING  REMARKS 

In  this  paper,  we  systematize  the  decomposition  of  broadcasting  dependences  into  propagation 
dependences.  Two  kinds  of  propagation  schemes,  elementary  and  composite  propagation,  are 
introduced.  It  is  shown  that  the  derivation  of  the  appropriate  decomposition  can  be  formulated  as  a 
linear  algebra  problem.  Moreover,  all  broadcasting  dependences  are  decomposable  into  propagation 
dependences. 

In  the  discussion,  we  provided  the  theoretical  framework  for  performing  the  decomposition  but 
avoided  the  actual  implementation  issues  because  the  optimal  choice  of  propagation  scheme  depends 
on  many  factors;  among  others, 

(1)  the  propagation  dependence  vectors  must  be  selected  in  a  way  compatible  with  the  other 
data  dependences  in  the  algorithm  for  the  transformed  algorithms  to  be  executed 
efficiently, 

(2)  the  number  of  propagation  variables  should  be  minimized  to  simplify  the  data  flow  and 
the  control  complexity  of  the  processor  array,  and, 

(3)  the  length  of  the  propagation  paths  should  be  minimized  as  the  computation  time  of  the 
algorithm  depends  on  the  length  of  the  longest  dependence  path  in  the  computation 
domain. 

Simple  heuristics  can  be  found' for  the  optimal  choice  of  propagation  scheme.  These  results  will  be 
reported  in  a  forthcoming  paper. 

Open  problems  and  related  research  currently  in  progress  include  : 

(1)  Can  efficient  methods  for  determining  all  the  feasible  propagation  schemes  be  found? 
This  is  equivalent  to  finding  all  the  feasible  bases  IV,  such  that  the  matrix  W~XB\V 
satisfies  the  reachability  condition,  a  condition  which  depends  on  the  non-zero  structure 
of  the  matrix. 

(2)  Is  it  possible  to  classify  and  characterize  all  such  feasible  bases? 

(3)  Does  there  always  exist  a  propagation  scheme  such  that  the  transformed  algorithm  has 
the  same  order  of  run  time  as  the  original  algorithm?  In  [4],  it  is  shown  that  any  systolic 
algorithm  with  computation  domain  of  dimension  s  has  0(N )  run  time  when 
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implemented  on  an  (*  -1  dimensional  systolic  array,  where  N  is  the  size  parameter  of  the 
algorithm.  We  would  like  to  know  whether  such  a  claim  can  be  extended  to  LDA  type 
algorithms 

(4)  How  is  the  control  and  communication  complexity  of  the  array  related  to  the  propagation 
scheme  used?  We  would  like  to  justify  the  cost  effectiveness  of  implementing  algorithms 
with  broadcasting  dependences  on  regularly  connected  VLSI  arrays. 

The  results  reported  in  this  paper  will  serve  as  stepping  stones  for  further  investigation  of  these 
topics. 
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